The following information contains data collected on the GDP of all countries and an independent international calculation of an absence of corruption index from a variety of factors including government transparency, public official use of state dollars for personal gain and general mistrust in government officials being honest in their actions between the years 1975 and 2021. Higher absence corruption index signals lower perceived corruption.
The following ER diagram represents how our data is structured
Code
# ER Diagramknitr::include_graphics(here::here("Figures", "ERDiagram.png"))
Variable
Type
Description
🔑Country
String
Full Name of a Country
🔑Year
Int
Year This data was collected
Country’s Absence of Corruption Index
Optional(Double)
A index measuring how corrupt a perceived country’s government is. Lower value means less trust and more corruption. NA means no data is unavailable for that year. Please check the section about NA values for some possible reasons
GDP Per Captia
Double
Calculated GDP per Captia (GDP/Population) in USD
The two quantitative variables our team have decided to focus on are GDP per Capita (GDP/Population) and Absence of Corruption as an indexed range. These variables are measured for all 172 countries. Below are the descriptions of each and what they represent, specifically in our data set: GDP per Capita (GDP/Population): This variable represents a countries GDP or Gross Domestic Product for a given year divided by the countries population size. In other words, this variable is simply the per-person GDP for that (Ingabire 2020) country.Absence of Corruption Index: This variable represents the lack of corruption within a countries public administration.(Gapminder 2023) This indicator is determined by assessing public sector corrupt exchanges, public sector theft, executive embezzlement and theft, executive bribery and corrupt exchanges. These values are determined by assessing the extent that public officials within an office administration do not use their power for personal benefit or gain. The scaling of this index is such that a lower value represent a higher absence of corruption, or simply put less corruption. Conversely, a high value for this variable represents a lower absence of corruption, or simply put, more corruption. The categorical variable Country refers to the country that is being observed or referred to when assessing the corresponding values for GDP per Capita and Absence of Corruption. The quantitative variable Year, refers to the given year that corresponds to the other variables observations. Year refers to data collected from January to December for the given year.
Below is a quick summary of the average mean absence corruption index and GDP per Capita out of every country for every year
Code
full_data |>group_by(country) |>mutate(across(.cols =c(Absence_Corruption_Index, GDP_Per_Capita), .fns =~mean(.x, na.rm =TRUE))) |>group_by(Year) |>summary() |>unclass() |>as_tibble() |>select(Absence_Corruption_Index, GDP_Per_Capita) |>drop_na() |>rename("Absence Corruption Index"= Absence_Corruption_Index, " GDP Per Capita (USD)"= GDP_Per_Capita) |>mutate_all(.funs =~prettyNum(.x, ",")) |>reactable() |>add_title("Summary of Absence Corruption Index & GDP Per Capita Data", font_size =20, align ="center")
Summary of
Absence Corruption Index & GDP Per Capita Data
About those Pesky NAs
One thing we do want to acknowledge is the presence of NAs in our datasheet, particularly in the corruption index. There could be a multitude of reasons why this that data isn’t present for a country in a particular year. Some examples are that the data was lost, data was measured or calculated incorrectly, some disaster prevented the data from being collected, there is an insufficient stability for determination to occur or that the government doesn’t want the public to know about their corruption (therefore attempting to censor it). All these factors, some randomized and some not, will be taken into our consideration when determining the relationship between corruption index and GDP per capita.
Possible Hypothesis
Explanatory Variable : Absence of Corruption of Index
Response Variable : GDP Per Capita
We believe that a higher absence of corruption index will lead to a higher GDP per capita. Traditionally, our more stable and democratic forms of governments have higher standards of living, less internal and or external conflicts, equal opportunities for all individuals, and large amounts of well fares for their constituency. Furthermore, in general, they have stronger economies that are more reliant to wild economic swings. For instance much of modern Europe has a high GDP per capita because of its modernization, democratic governments and tight integration with the EU.
Data Visualizations
Code
# Renames the year from xYYYY -> YYYYfull_data <- full_data |>mutate(Year =as.numeric(str_remove_all(Year, "^x")))
History of Absence Of Corruption vs GDP per Capita
Code
full_data |># Drops NA values cant be plotted...drop_na() |>ggplot(aes(x = Absence_Corruption_Index, y = GDP_Per_Capita, color = country)) +# Plots each country's datageom_point(alpha =0.7, show.legend =FALSE) +# Drops the colour guide wouldn't be helpful for 172 countries...guides(color ="none") +# Seperates by regionfacet_wrap(~region) +# Makes a regression line for easier viewinggeom_smooth(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita, color ="blue"), method ="lm", se =FALSE) +# Tittle stufflabs(title ="Absence of Corruption vs GDP per Capita", subtitle ="Year: {frame_time}", x ="Absence of Corruption", y ="GDP Per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5), plot.subtitle =element_text(hjust =0.5)) +scale_y_continuous(labels = comma) +# Animating by yeartransition_time(as.integer(Year)) +ease_aes("linear")
In this animation, we can see the observations and fit line for each year separated by region, with each color representing one country across time. Starting in Europe, we can see a steady rise in GDP per Capita across countries with higher Absence of Corruption over the years, leading to a steeper, or more positive, relationship in recent years. Europe is also home Monaco, a country with a GDP per Capita of over 2 million, and the only country that is not visible across the graphs. The Americas produces an interesting time-lapse because most countries don’t change significantly in GDP per Capita, they slowly become less corrupt, which makes the relationship more positive. The two high outliers in this region are the USA in pink and Canada in yellow, and while they both gradually rise in GDP per Capita, the USA experiences a drop in Absence in Corruption in the late 2010s.
The African plot is the only plot to display a negative relationship at any point, which is likely because the vast majority of countries in this region have extremely low GDP per Capita, so a country with any value on the x-axis can influence the line with a spike in GDP per Capita. In particular, the countries Gabon and Libya had the highest GDP per Capita in the 1980s, despite having Absence of Corruption values in the 20’s, so they heavily increased the trend and turned it negative. But the trend in Africa eventually becomes positive. Finally, the Asia-Pacific region has features similar to all 3 other plots, with many countries close to the x-axis as well as countries with high Absence of Corruption values increasing in GDP per Capita over time. The most interesting feature here is three countries with mediocre values in Absence of Corruption having high GDP per Capita near the beginning, and they fluctuate intensely across the years. These three countries are Qatar, the UAE, and Kuwait. While this happens, Singapore, a country with high Absence of Corruption, rises to the top of the continent, making the trend more positive.
With this animation, we have strong reason to believe our original hypothesis. So now we will test the correlation with regression.
2021 Absence Of Corruption vs GDP per Capita
Below is a plot of countries and their respective Absence of Corruption Index vs. GDP Per Capita in the year 2021. At this point, we note that all of the plots from this point use 2021 as the year. This particular year is our basis, since it is has no NA values and is balanced. Furthermore, it is a recent year, fairly representative of the current times.
Code
full_data |>filter(Year ==2021) |>ggplot(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Absence of Corruption vs GDP per Capita 2021", x ="Absence of Corruption", y ="GDP per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5)) +scale_y_continuous(labels = comma)
The scatterplot and fit line definitely indicate that there is a positive relationship between our variables. Aside from four high outliers, each data point lies relatively close to the fit line. The part that stands out the most is the amount of observations that are bunched up on the x-axis at 0. Many countries in our dataset only have a GDP per Capita in the hundreds or low thousands, so due to how the plot is scaled, they all appear on the same horizontal line. Because of this, we will try a log transformation so these observations are more visible.
Log Transformed 2021 Absence Of Corruption vs GDP per Capita
Code
# With Log Transformationfull_data |>filter(Year ==2021) |>mutate(GDP_Per_Capita =log(GDP_Per_Capita)) |>ggplot(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Absence of Corruption vs GDP per Capita(Log Transformed)\n 2021", x ="Absence of Corruption", y ="Log of GDP per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5))
The log transformation succeeded in its goal of breaking up the lower GDP per Capita observations. However, it is likely that the strength of our model was decreased. It seems that after the log transformation, the observations broke into two “pools” separated by GDP per Capita. So while there are no outliers on this axis, the vast majority of observations in the lower pool fall below the regression line, and in the higher pool, most observations are above it. To see how much an increase in score of absence of corruption really helps GDP per Captia and how strong the correlation between the explanatory and response variables are lets closely analyze a linear regression between these two vairables.
Data Modeling & Prediction
Creating our Regression Models
While the log transformation gives us an easier way to see the upwards trend between Absence of Corruption Index and the nation’s GDP per capita, it actually does NOT help our linear model. In fact, comparing the computed R-squared values is worse on the log transformed model.
Code
# 2021 Datafull_2021_data <- full_data |>filter(Year ==2021)# 2021 Data with log transformationlog_2021_data <- full_2021_data |>mutate(GDP_Per_Capita =log(GDP_Per_Capita))
Code
full_2021_data_lm <-lm(GDP_Per_Capita ~ Absence_Corruption_Index, full_2021_data)log_2021_data_lm <-lm(GDP_Per_Capita ~ Absence_Corruption_Index, log_2021_data)tibble("R Squared Non Transformed"=summary(full_2021_data_lm)$r.squared, "R Squared Log Transformed"=summary(log_2021_data_lm)$r.squared) |>reactable() |>add_title("Comparison of R Squared of Two Linear Models", font_size =20, align ="center")
Comparison of R Squared of Two Linear Models
With that in mind, we will be sticking with our untransformed data for the remainder of this exploration.
Code
# Removes the transformed log modelsremove(log_2021_data, log_2021_data_lm)# Summary of Linear Model :broom::tidy(full_2021_data_lm) |>rename("Regression Variable"= term, "Model Estimate"= estimate, "Standard Error"= std.error) |>select(c("Regression Variable":"Standard Error")) |>mutate_all(.funs =~prettyNum(.x, ",")) |>reactable() |>add_title("Linear Model Stats", font_size =20, align ="center")
Linear Model Stats
Our linear model comparing different countries in 2021 and their respective Absence of Corruption Index vs their 2021 GDP Per Capita yields us the following results.
\[
\beta_0 = -199747.304 \\
\beta_1 = 8211.695
\] which will give us the linear regression formula of : \[
\widehat{GDP\_Per\_Captia} = -199747.304+8211.695(Absence\_Corruption\_Index)
\]
Analysis
With an \(R^2\) value of 0.50510, approximately 50% of our observed GDP per capita variability is accounted for in our correlation. For every score increase of 1 for a country’s Absence of Corruption Index, their GDP per capita will increase $8211.66.
Checking the Linear Model’s Accuracy
Although we have a mild \(R^2\) relation between Absence of Corruption Index and its response variable GDP per capita, we still want to check that the model fita. One way to check the average residual of our model is below.
With a average residual \(<|10^{-9}|\), this looks pretty promising. However, to confirm our calculations and better visualize the distributions of residuals, lets plot them on a histogram. What we expect is a normal distribution of all residuals reminiscent of the familiar bell curve.
Another way we can also check our linear model is through the average variance or spread of our data set. Ideally we want our observed data to have lower variance as it means its more compactly distrubuted and easier for our model to predict
Considering that we are working with thousands of dollars, a variance of 53038666343.0201 is alright, though its still quite far off our fitted (predicted model) variance of 26790021214.0854, so our model could be alot better. Only simulations will tell how truly accurate this model is.
Simulations and Model
Lastly lets perform a simulation on our data to see how our model performs compared to our observed data. In this simulation we will only be dealing with 2021 data as it has no NAs and represents the state of the world as it is currently in the early 21st century
After grabbing our predicted and observed values for each country from 2021 we will introduce the standard error into our predicted data just to make it more realistic. Afterwards we will compared this modified simulated data to our gathered observed values. If our model is a good fit we should expect, a \(R^2\) value close tp 1 meaning that nearly all our (noise-induced) simulated data closely matches our observed data. We will then run 5000 trials and look at the minimum, average and maximum \(R^2\)
Oh… it seems like our model was as good as we originally predicted. After standard error injection, only about 0.2575815 or 25.7581455 % of observed data can be modeled using our simulated data.
We can more visually see the spread of \(R^2\) we got from our 5000 simulations from plotting the \(R^2\) results on a histogram
Code
options(scipen =100)sim_model_res |>as.data.frame() |>ggplot(mapping =aes(x = sim_model_res)) +geom_histogram(aes(y =after_stat(density)), bins =50) +ggtitle(bquote("Distrubution of "~ R^2~"between Normalized Simulated Data and Observed Data")) +labs(x =bquote(R^2), y ="Percentage of\n Distrubution") +geom_density(colour ="steelblue") +theme(axis.title.y =element_text(angle =0, vjust =0.5), plot.title =element_text(hjust =0.5))
Addressing the NAs into consideration
Earlier we touched on the NAs present in our Absent of Corruption within our data, and listed some potential reasons for their absense. Now we will plot all entries that have NAs and their respective GDP per Capita for that year to verify if our original hypothesis is correct. We will also count the number of NAs from each region to see if we notice any trends between their regions and absence of any curruption information
Code
# Histogram of gpd-per capita by countries with N/A value for Corruption Index Valuena_data <- full_data |>filter(is.na(Absence_Corruption_Index))
Code
ggplot(na_data, aes(x = GDP_Per_Capita)) +geom_histogram() +scale_x_continuous(breaks =seq(min(na_data$GDP_Per_Capita), max(na_data$GDP_Per_Capita), by =50000)) +labs(x ="GDP Per Capita", y ="Number of Countries", title ="GDP Per Capita For Countries Missing Corruption Index Values") +theme(axis.title.y =element_text(angle =0, vjust =0.5), plot.title =element_text(hjust =0.5))
Code
country_region <- full_2021_data |>select(country, region)na_data_rich <- na_data |>filter(GDP_Per_Capita >=50000)na_after_soviet <- na_data |>filter(Year >1991)na_data |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")
Number of NAs from Countries
Ordered by Number of NAs
Code
na_after_soviet |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries\n Post Collapse of the Soviet Union: 1991", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")
Number of NAs from Countries
Post Collapse of the Soviet Union: 1991
Ordered by Number of NAs
Code
na_data_rich |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries \n w GDP Per Capita over 50K USD", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")
Number of NAs from Countries
w GDP Per Capita over 50K USD
Ordered by Number of NAs
The above histogram depicts the distribution of GPD’s Per Capita amongst countries that have missing values (N/A) for the absence of corruption index. This histogram includes all years from the original data set, 1975-2021, and all countries from this grouping are included, to better depict the distribution. The graph shows that there is a very large amount of countries, around 120, that have very low GDP’s per capita (around the several thousand dollars range) and do not contain values for absence of corruption. Aside from this, there appears to be a slightly right-skewed distribution centered at about 150,000, containing the remainder of the data from our sub-set group. Additionally, the max of this right tail hovers around 280,000. Some of this variation could be explained that less fortunate countries has less resources to measure these indexes. Additionally we see that many of these counties are “newer” countries, and only have formed in the past 30 years(Rosenberg 2019) number particularly from previously Soviet states. That being said, the large number of NAs from poorer countries, still possibly suggests that there leaders and governments are deliberate action taken to attempt to hide internal corruption, maybe spending the countries resources for their own benefit instead of reform and social welfare programs. The smaller but still significant “clump” of NAs on the right primary are composed of 2 groups, 1: rich monarchies trying to hide their powers like some in the Middle East as discussed earlier or 2: more present in this data, Eastern Europe nations particularly the Baltics(Åslund 2015) and to a lesser extent Balkans getting really rich as demonstrated in the 3rd table above which shows NA values with GDP per Captia over \(\$50,000\) , \(\frac{9}{10}\) of the countries were of European countries and half of them are now part of the EU(Union 2023), therefore, overall in many regards, the NAs should not be a major issue we need to contend with. For those countries that have been established for a while, a missing NAs does raise quite a suspicion of what their government are up to.
Conclsuion
Overall after all our analysis and models we still believe that an country’s absence of corruption index still has some role in shaping the GDP per Captia of a country. While not as strong and cohesive as we originally hypothesis, the overall however corruption among weather nations does still have us think it the absence of corruption index have a role in influencing ones economy. Nonetheless if this data has shown us anything, it is the fact that there’s no one factor that determines one country’s GDP per Captia, geography, access to natural resources, population, labor force, political structure and stability all contribute in shaping the GDP Per Captia.
---title: "Correlation of Country, Government, Index & GDP Per Capita"author: "Brian Kwong, Connor Gamba, Joey Bailitz, Arneh Begi"engine: knitrformat: html: self-contained: true code-tools: true toc: true theme: Minty df-print: kableeditor: sourcecode-fold: trueexecute: error: true echo: true message: false warning: falsebibliography: references.bib---## Import the Needed LibrariesPlease make sure you have the following libraries installed:1. Tidyverse2. Here3. Janitor4. Styler5. Reactablefmtr6. Gganimate7. Gifski8. formattable```{r}#| output: False# Set output scopeoptions(scipen =0)# Clears environmentremove(list =as.vector(ls()))#| output: Falselibrary(tidyverse)library(janitor)library(styler)library(reactablefmtr)library(gganimate)library(gifski)library(formattable)# Formats documents:style_file(path = here::here("Src", "Correlation_of_Country_Goverment_Index_&_GDP_Captia.qmd"), style = tidyverse_style, strict =TRUE)```# Introduction## BackgroundThe following information contains data collected on the GDP of all countries and an independent international calculation of an absence of corruption index from a variety of factors including government transparency, public official use of state dollars for personal gain and general mistrust in government officials being honest in their actions between the years 1975 and 2021. Higher absence corruption index signals ***lower*** perceived corruption.# Our Data## Importing the Data```{r}corruption_data <-read_csv(here::here("Datasets", "abscorrup_idea.csv"), show_col_types =FALSE) |> janitor::clean_names()gdp_per_cap_data <-read_csv(here::here("Datasets", "gdp_pcap.csv"), show_col_types =FALSE) |> janitor::clean_names()```## Matching Our DaasetSince our corruption only has data for years between \[1975,2021\] we filter our GDP per capita to those years too```{r}gdp_per_cap_data <- gdp_per_cap_data |>select(country, c(x1975:x2021))```## Piviot Data to Long Format```{r}#| warning: falsecorruption_data <- corruption_data |>pivot_longer(cols =c(x1975:x2021), names_to ="Year", values_to ="Absence_Corruption_Index") |>mutate(Absence_Corruption_Index =as.numeric(Absence_Corruption_Index))gdp_per_cap_data <- gdp_per_cap_data |>pivot_longer(cols =c(x1975:x2021), names_to ="Year", values_to ="GDP_Per_Capita") |>mutate(GDP_Per_Capita =if_else(str_detect(GDP_Per_Capita, "k$"), (as.numeric(str_remove_all(GDP_Per_Capita, "k$")) *10000), as.numeric(GDP_Per_Capita)))```## Data Joins```{r}full_data <- corruption_data |>inner_join(gdp_per_cap_data, by =c("country", "Year"))full_data <- full_data |>mutate(country =as_factor(country)) |>mutate(region =as.factor(case_when( country %in%c("Antigua and Barbuda", "Argentina", "Bahamas", "Barbados","Belize", "Bolivia", "Brazil", "Canada", "Chile", "Colombia","Costa Rica", "Cuba", "Dominica", "Dominican Republic","Ecuador", "El Salvador", "Grenada", "Guatemala", "Guyana","Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua", "Panama","Paraguay", "Peru", "St. Kitts and Nevis", "St. Lucia","St. Vincent and the Grenadines", "Suriname","Trinidad and Tobago", "USA", "Uruguay","Venezuela" ) ~"The Americas", country %in%c("Albania", "Andorra", "Armenia", "Austria", "Azerbaijan","Belarus", "Belgium", "Bosnia and Herzegovina", "Bulgaria","Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia","Finland", "France", "Georgia", "Germany", "Greece", "Holy See","Hungary", "Iceland", "Ireland", "Italy", "Latvia","Liechtenstein", "Lithuania", "Luxembourg", "North Macedonia","Malta", "Moldova", "Monaco", "Montenegro", "Netherlands","Norway", "Poland", "Portugal", "Romania", "Russia", "San Marino","Serbia", "Slovak Republic", "Slovenia", "Spain", "Sweden","Switzerland", "Turkey", "Ukraine","UK" ) ~"Europe", country %in%c("Algeria", "Angola", "Benin", "Botswana", "Burkina Faso","Burundi", "Cameroon", "Cape Verde", "Central African Republic","Chad", "Comoros", "Congo, Dem. Rep.", "Congo, Rep.","Cote d'Ivoire", "Djibouti", "Egypt", "Equatorial Guinea","Eritrea", "Ethiopia", "Gabon", "Gambia", "Ghana", "Guinea","Guinea-Bissau", "Kenya", "Lesotho", "Liberia", "Libya","Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius","Morocco", "Mozambique", "Namibia", "Niger", "Nigeria", "Rwanda","Sao Tome and Principe", "Senegal", "Seychelles","Sierra Leone", "Somalia", "South Africa", "Sudan", "Eswatini","Tanzania", "Togo", "Tunisia", "Uganda", "Zambia", "Zimbabwe","South Sudan" ) ~"Africa", country %in%c("Afghanistan", "Australia", "Bahrain", "Bangladesh", "Bhutan","Brunei", "Cambodia", "China", "Fiji", "Hong Kong, China","India", "Indonesia", "Iran", "Iraq", "Israel", "Japan", "Jordan","Kazakhstan", "Kiribati", "North Korea", "South Korea", "Kuwait","Kyrgyz Republic", "Lao", "Lebanon", "Malaysia", "Maldives","Marshall Islands", "Micronesia, Fed. Sts.", "Mongolia","Myanmar", "Nauru", "Nepal", "New Zealand", "Oman", "Pakistan","Palau", "Papua New Guinea", "Philippines", "Qatar", "Samoa","Saudi Arabia", "Singapore", "Solomon Islands", "Sri Lanka","Syria", "Taiwan", "Tajikistan", "Thailand", "Timor-Leste","Tonga", "Turkmenistan", "Tuvalu", "UAE","Uzbekistan", "Vanuatu", "Palestine", "Vietnam","Yemen" ) ~"Asia-Pacific" )))```## Understanding Our DataThe following ER diagram represents how our data is structured```{r include-flow-map, fig.align='center'}# ER Diagramknitr::include_graphics(here::here("Figures", "ERDiagram.png"))```| Variable | Type | Description ||--|--|--|| 🔑Country | String | Full Name of a Country || 🔑Year | Int | Year This data was collected| Country's Absence of Corruption Index | Optional(Double) | A index measuring how corrupt a perceived country's government is. Lower value means less trust and more corruption. NA means no data is unavailable for that year. Please check the section about NA values for some possible reasons || GDP Per Captia | Double | Calculated GDP per Captia (GDP/Population) in USDThe two quantitative variables our team have decided to focus on are GDP per Capita (GDP/Population) and Absence of Corruption as an indexed range. These variables are measured for all 172 countries. Below are the descriptions of each and what they represent, specifically in our data set: GDP per Capita (GDP/Population): This variable represents a countries GDP or Gross Domestic Product for a given year divided by the countries population size. In other words, this variable is simply the per-person GDP for that [@ingabire_2020_gapminder] country.Absence of Corruption Index: This variable represents the lack of corruption within a countries public administration.[@gapminder_2023__ideademocracy] This indicator is determined by assessing public sector corrupt exchanges, public sector theft, executive embezzlement and theft, executive bribery and corrupt exchanges. These values are determined by assessing the extent that public officials within an office administration do not use their power for personal benefit or gain. The scaling of this index is such that a lower value represent a higher absence of corruption, or simply put less corruption. Conversely, a high value for this variable represents a lower absence of corruption, or simply put, more corruption. The categorical variable Country refers to the country that is being observed or referred to when assessing the corresponding values for GDP per Capita and Absence of Corruption. The quantitative variable Year, refers to the given year that corresponds to the other variables observations. Year refers to data collected from January to December for the given year.## Summary```{r}#| output: Falsefull_data |>select(country) |>summarise(Count =n_distinct(country))```There are a total of ***172*** countries in this studyBelow is a quick summary of the average mean absence corruption index and GDP per Capita out of every country for every year```{r}full_data |>group_by(country) |>mutate(across(.cols =c(Absence_Corruption_Index, GDP_Per_Capita), .fns =~mean(.x, na.rm =TRUE))) |>group_by(Year) |>summary() |>unclass() |>as_tibble() |>select(Absence_Corruption_Index, GDP_Per_Capita) |>drop_na() |>rename("Absence Corruption Index"= Absence_Corruption_Index, " GDP Per Capita (USD)"= GDP_Per_Capita) |>mutate_all(.funs =~prettyNum(.x, ",")) |>reactable() |>add_title("Summary ofAbsence Corruption Index & GDP Per Capita Data", font_size =20, align ="center")```## About those Pesky NAsOne thing we do want to acknowledge is the presence of NAs in our datasheet, particularly in the corruption index. There could be a multitude of reasons why this that data isn't present for a country in a particular year. Some examples are that the data was lost, data was measured or calculated incorrectly, some disaster prevented the data from being collected, there is an insufficient stability for determination to occur or that the government doesn't want the public to know about their corruption (therefore attempting to censor it). All these factors, some randomized and some not, will be taken into our consideration when determining the relationship between corruption index and GDP per capita.## Possible Hypothesis- Explanatory Variable : Absence of Corruption of Index- Response Variable : GDP Per CapitaWe believe that a higher absence of corruption index will lead to a higher GDP per capita. Traditionally, our more stable and democratic forms of governments have higher standards of living, less internal and or external conflicts, equal opportunities for all individuals, and large amounts of well fares for their constituency. Furthermore, in general, they have stronger economies that are more reliant to wild economic swings. For instance much of modern Europe has a high GDP per capita because of its modernization, democratic governments and tight integration with the EU.# Data Visualizations```{r}# Renames the year from xYYYY -> YYYYfull_data <- full_data |>mutate(Year =as.numeric(str_remove_all(Year, "^x")))```## History of Absence Of Corruption vs GDP per Capita```{r}full_data |># Drops NA values cant be plotted...drop_na() |>ggplot(aes(x = Absence_Corruption_Index, y = GDP_Per_Capita, color = country)) +# Plots each country's datageom_point(alpha =0.7, show.legend =FALSE) +# Drops the colour guide wouldn't be helpful for 172 countries...guides(color ="none") +# Seperates by regionfacet_wrap(~region) +# Makes a regression line for easier viewinggeom_smooth(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita, color ="blue"), method ="lm", se =FALSE) +# Tittle stufflabs(title ="Absence of Corruption vs GDP per Capita", subtitle ="Year: {frame_time}", x ="Absence of Corruption", y ="GDP Per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5), plot.subtitle =element_text(hjust =0.5)) +scale_y_continuous(labels = comma) +# Animating by yeartransition_time(as.integer(Year)) +ease_aes("linear")```In this animation, we can see the observations and fit line for each year separated by region, with each color representing one country across time. Starting in Europe, we can see a steady rise in GDP per Capita across countries with higher Absence of Corruption over the years, leading to a steeper, or more positive, relationship in recent years. Europe is also home Monaco, a country with a GDP per Capita of over 2 million, and the only country that is not visible across the graphs. The Americas produces an interesting time-lapse because most countries don’t change significantly in GDP per Capita, they slowly become less corrupt, which makes the relationship more positive. The two high outliers in this region are the USA in pink and Canada in yellow, and while they both gradually rise in GDP per Capita, the USA experiences a drop in Absence in Corruption in the late 2010s. \nThe African plot is the only plot to display a negative relationship at any point, which is likely because the vast majority of countries in this region have extremely low GDP per Capita, so a country with any value on the x-axis can influence the line with a spike in GDP per Capita. In particular, the countries Gabon and Libya had the highest GDP per Capita in the 1980s, despite having Absence of Corruption values in the 20’s, so they heavily increased the trend and turned it negative. But the trend in Africa eventually becomes positive. Finally, the Asia-Pacific region has features similar to all 3 other plots, with many countries close to the x-axis as well as countries with high Absence of Corruption values increasing in GDP per Capita over time. The most interesting feature here is three countries with mediocre values in Absence of Corruption having high GDP per Capita near the beginning, and they fluctuate intensely across the years. These three countries are Qatar, the UAE, and Kuwait. While this happens, Singapore, a country with high Absence of Corruption, rises to the top of the continent, making the trend more positive. \nWith this animation, we have strong reason to believe our original hypothesis. So now we will test the correlation with regression.\n## 2021 Absence Of Corruption vs GDP per CapitaBelow is a plot of countries and their respective Absence of Corruption Index vs. GDP Per Capita in the year 2021. At this point, we note that all of the plots from this point use 2021 as the year. This particular year is our basis, since it is has no NA values and is balanced. Furthermore, it is a recent year, fairly representative of the current times.```{r}full_data |>filter(Year ==2021) |>ggplot(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Absence of Corruption vs GDP per Capita 2021", x ="Absence of Corruption", y ="GDP per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5)) +scale_y_continuous(labels = comma)``` The scatterplot and fit line definitely indicate that there is a positive relationship between our variables. Aside from four high outliers, each data point lies relatively close to the fit line. The part that stands out the most is the amount of observations that are bunched up on the x-axis at 0. Many countries in our dataset only have a GDP per Capita in the hundreds or low thousands, so due to how the plot is scaled, they all appear on the same horizontal line. Because of this, we will try a log transformation so these observations are more visible.## Log Transformed 2021 Absence Of Corruption vs GDP per Capita```{r}# With Log Transformationfull_data |>filter(Year ==2021) |>mutate(GDP_Per_Capita =log(GDP_Per_Capita)) |>ggplot(mapping =aes(x = Absence_Corruption_Index, y = GDP_Per_Capita)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Absence of Corruption vs GDP per Capita(Log Transformed)\n 2021", x ="Absence of Corruption", y ="Log of GDP per Capita (USD)") +theme(axis.title.y =element_text(vjust =0.5, angle =0), plot.title =element_text(hjust =0.5))``` The log transformation succeeded in its goal of breaking up the lower GDP per Capita observations. However, it is likely that the strength of our model was decreased. It seems that after the log transformation, the observations broke into two “pools” separated by GDP per Capita. So while there are no outliers on this axis, the vast majority of observations in the lower pool fall below the regression line, and in the higher pool, most observations are above it. To see how much an increase in score of absence of corruption really helps GDP per Captia and how strong the correlation between the explanatory and response variables are lets closely analyze a linear regression between these two vairables.# Data Modeling & Prediction## Creating our Regression ModelsWhile the log transformation gives us an easier way to see the upwards trend between Absence of Corruption Index and the nation's GDP per capita, it actually does ***NOT*** help our linear model. In fact, comparing the computed R-squared values is ***worse*** on the log transformed model.```{r}# 2021 Datafull_2021_data <- full_data |>filter(Year ==2021)# 2021 Data with log transformationlog_2021_data <- full_2021_data |>mutate(GDP_Per_Capita =log(GDP_Per_Capita))``````{r}full_2021_data_lm <-lm(GDP_Per_Capita ~ Absence_Corruption_Index, full_2021_data)log_2021_data_lm <-lm(GDP_Per_Capita ~ Absence_Corruption_Index, log_2021_data)tibble("R Squared Non Transformed"=summary(full_2021_data_lm)$r.squared, "R Squared Log Transformed"=summary(log_2021_data_lm)$r.squared) |>reactable() |>add_title("Comparison of R Squared of Two Linear Models", font_size =20, align ="center")```With that in mind, we will be sticking with our ***untransformed data*** for the remainder of this exploration.```{r}# Removes the transformed log modelsremove(log_2021_data, log_2021_data_lm)# Summary of Linear Model :broom::tidy(full_2021_data_lm) |>rename("Regression Variable"= term, "Model Estimate"= estimate, "Standard Error"= std.error) |>select(c("Regression Variable":"Standard Error")) |>mutate_all(.funs =~prettyNum(.x, ",")) |>reactable() |>add_title("Linear Model Stats", font_size =20, align ="center")```Our linear model comparing different countries in 2021 and their respective Absence of Corruption Index vs their 2021 GDP Per Capita yields us the following results.$$\beta_0 = -199747.304 \\\beta_1 = 8211.695$$which will give us the linear regression formula of : $$\widehat{GDP\_Per\_Captia} = -199747.304+8211.695(Absence\_Corruption\_Index)$$## AnalysisWith an $R^2$ value of 0.50510, approximately 50% of our observed GDP per capita variability is accounted for in our correlation. For every score increase of 1 for a country's Absence of Corruption Index, their GDP per capita will increase $8211.66.## Checking the Linear Model's AccuracyAlthough we have a mild $R^2$ relation between Absence of Corruption Index and its response variable GDP per capita, we still want to check that the model fita. One way to check the average residual of our model is below.```{r}full_2021_data_lm_resd <- full_2021_data_lm |> broom::augment() |>select(.resid)full_2021_data_lm_resd |>summarize("Avergae Residual"=mean(.resid)) |>reactable() |>add_title("Model Residual Stats", font_size =20, align ="center")```With a average residual $<|10^{-9}|$, this looks pretty promising. However, to confirm our calculations and better visualize the distributions of residuals, lets plot them on a histogram. What we expect is a normal distribution of all residuals reminiscent of the familiar bell curve.```{r}options(scipen =100)full_2021_data_lm_resd |>ggplot(mapping =aes(x = .resid)) +geom_histogram(aes(y =after_stat(density)), bins =50) +labs(x ="Residuals", y ="Percentage of Distrubution") +geom_density(colour ="steelblue") +theme(axis.title.y =element_text(angle =0, vjust =0.5))options(scipen =0)```Another way we can also check our linear model is through the average variance or spread of our data set. Ideally we want our observed data to have lower variance as it means its more compactly distrubuted and easier for our model to predict```{r}# Variance Table for Observed, fitted and residualfull_2021_data_lm |> broom::augment() |>select(GDP_Per_Capita, .fitted, .resid) |>summarise(across(.cols =c(GDP_Per_Capita, .fitted, .resid), .fns =~var(x = .x))) |>rename("Observed"= GDP_Per_Capita, "Fitted"= .fitted, "Residual"= .resid) |>mutate_all(.funs =~prettyNum(.x, ",")) |>reactable() |>add_title("Model Variance Stats", font_size =20, align ="center")```Considering that we are working with thousands of dollars, a variance of 53038666343.0201 is alright, though its still quite far off our fitted (predicted model) variance of 26790021214.0854, so our model could be alot better. Only simulations will tell how truly accurate this model is.## Simulations and Model Lastly lets perform a simulation on our data to see how our model performs compared to our observed data. In this simulation we will only be dealing with ***2021*** data as it has no NAs and represents the state of the world as it is currently in the early 21st century```{r}observed_2021_data <- full_2021_data |>select(GDP_Per_Capita) |>mutate(GDP_Per_Capita =as.numeric(GDP_Per_Capita)) |>pull()predicted_2021_data <- full_2021_data_lm |>predict()```After grabbing our predicted and observed values for each country from 2021 we will introduce the standard error into our predicted data just to make it more realistic. Afterwards we will compared this modified simulated data to our gathered observed values. If our model is a good fit we should expect, a $R^2$ value close tp 1 meaning that nearly all our (noise-induced) simulated data closely matches our observed data. We will then run 5000 trials and look at the minimum, average and maximum $R^2$```{r}generateRealisticPredictions <-function(predictedValues, standardError) { mutatedPredictedValues <- predictedValues +rnorm(length(predictedValues), 0, standardError)return(mutatedPredictedValues)}findModelFit <-function(linearModel) { newPredicted <- linearModel |> broom::augment() newPredicted <- newPredicted |>select(GDP_Per_Capita, .fitted) |>mutate(.fitted =generateRealisticPredictions(.fitted, sigma(linearModel))) observedPredictedLm <-lm(GDP_Per_Capita ~ .fitted, newPredicted)return(broom::glance(observedPredictedLm) |>select(r.squared) |>pull())}``````{r}set.seed(05081995)sim_model_res <-map_dbl(.x =c(1:5000), .f =~findModelFit(full_2021_data_lm))mean_sim_model_res <-mean(sim_model_res)summary(sim_model_res) |>unclass() |>as.list() |>as.data.frame() |>pivot_longer(cols =c("Min.":"Max."), names_to ="Stat", values_to ="Values") |>mutate(Stat =str_remove(Stat, "[X]")) |>mutate(Stat =str_replace(Stat, "\\.", " ")) |>reactable() |>add_title("Model Residual Simulations Stats", font_size =20, align ="center")```Oh... it seems like our model was as good as we originally predicted. After standard error injection, only about `r mean_sim_model_res` or `r mean_sim_model_res*100` % of observed data can be modeled using our simulated data. We can more visually see the spread of $R^2$ we got from our 5000 simulations from plotting the $R^2$ results on a histogram```{r,fig.width=9}options(scipen =100)sim_model_res |>as.data.frame() |>ggplot(mapping =aes(x = sim_model_res)) +geom_histogram(aes(y =after_stat(density)), bins =50) +ggtitle(bquote("Distrubution of "~ R^2~"between Normalized Simulated Data and Observed Data")) +labs(x =bquote(R^2), y ="Percentage of\n Distrubution") +geom_density(colour ="steelblue") +theme(axis.title.y =element_text(angle =0, vjust =0.5), plot.title =element_text(hjust =0.5))```# Addressing the NAs into considerationEarlier we touched on the NAs present in our Absent of Corruption within our data, and listed some potential reasons for their absense. Now we will plot all entries that have NAs and their respective GDP per Capita for that year to verify if our original hypothesis is correct. We will also count the number of NAs from each region to see if we notice any trends between their regions and absence of any curruption information```{r}# Histogram of gpd-per capita by countries with N/A value for Corruption Index Valuena_data <- full_data |>filter(is.na(Absence_Corruption_Index))``````{r}ggplot(na_data, aes(x = GDP_Per_Capita)) +geom_histogram() +scale_x_continuous(breaks =seq(min(na_data$GDP_Per_Capita), max(na_data$GDP_Per_Capita), by =50000)) +labs(x ="GDP Per Capita", y ="Number of Countries", title ="GDP Per Capita For Countries Missing Corruption Index Values") +theme(axis.title.y =element_text(angle =0, vjust =0.5), plot.title =element_text(hjust =0.5))``````{r}country_region <- full_2021_data |>select(country, region)na_data_rich <- na_data |>filter(GDP_Per_Capita >=50000)na_after_soviet <- na_data |>filter(Year >1991)na_data |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")na_after_soviet |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries\n Post Collapse of the Soviet Union: 1991", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")na_data_rich |>group_by(country) |>summarise(NumbersOfNAEnrties =n()) |>inner_join(country_region, by ="country") |>slice_max(order_by = NumbersOfNAEnrties, n =10, with_ties =FALSE) |>rename("Number of NA Enrties"= NumbersOfNAEnrties, "Region"= region, "Country"= country) |>reactable() |>add_title("Number of NAs from Countries \n w GDP Per Capita over 50K USD", font_size =20, align ="center") |>add_title("Ordered by Number of NAs", font_size =12, align ="center")```The above histogram depicts the distribution of GPD's Per Capita amongst countries that have missing values (N/A) for the absence of corruption index. This histogram includes all years from the original data set, 1975-2021, and all countries from this grouping are included, to better depict the distribution. The graph shows that there is a very large amount of countries, around 120, that have very low GDP's per capita (around the several thousand dollars range) and do not contain values for absence of corruption. Aside from this, there appears to be a slightly right-skewed distribution centered at about 150,000, containing the remainder of the data from our sub-set group. Additionally, the max of this right tail hovers around 280,000. Some of this variation could be explained that less fortunate countries has less resources to measure these indexes. Additionally we see that many of these counties are "newer" countries, and only have formed in the past 30 years[@bymattrosenberg_2019_the] number particularly from previously Soviet states. That being said, the large number of NAs from poorer countries, still possibly suggests that there leaders and governments are deliberate action taken to attempt to hide internal corruption, maybe spending the countries resources for their own benefit instead of reform and social welfare programs. The smaller but still significant "clump" of NAs on the right primary are composed of 2 groups, 1: rich monarchies trying to hide their powers like some in the Middle East as discussed earlier or 2: more present in this data, Eastern Europe nations particularly the Baltics[@slund_2015_the] and to a lesser extent Balkans getting really rich as demonstrated in the 3rd table above which shows NA values with GDP per Captia over $\$50,000$ , $\frac{9}{10}$ of the countries were of European countries and half of them are now part of the EU[@theeuropeanunion_2023_easy], therefore, overall in many regards, the NAs should not be a major issue we need to contend with. For those countries that have been established for a while, a missing NAs does raise quite a suspicion of what their government are up to.# ConclsuionOverall after all our analysis and models we still believe that an country's absence of corruption index still has some role in shaping the GDP per Captia of a country. While not as strong and cohesive as we originally hypothesis, the overall however corruption among weather nations does still have us think it the absence of corruption index have a role in influencing ones economy. Nonetheless if this data has shown us anything, it is the fact that there's no one factor that determines one country's GDP per Captia, geography, access to natural resources, population, labor force, political structure and stability all contribute in shaping the GDP Per Captia.